A Geometric Approach to Mapping Bitext Correspondence

نویسنده

  • I. Dan Melamed
چکیده

T h e f i r s t s t e p in m o s t c o r p u s b a s e d mul t i l i ngua l N L P work is t o c o n s t r u c t a det a i l e d m a p o f t h e c o r r e s p o n d e n c e b e t w e e n a t e x t a n d i ts t r a n s l a t i o n . Seve ra l au to m a t i c m e t h o d s fo r th i s t a sk have b e e n p rop o s e d in r e c e n t yea r s . "Yet e v e n t h e b e s t o f t h e s e m e t h o d s can e r r by seve ra l t y p e s e t pages . T h e S m o o t h I n j e c t i v e M a p Recogn i z e r ( S I M R ) is a new b i t e x t m a p p i n g alg o r i t h m . S I M R ' s e r r o r s a r e s m a l l e r t h a n t h o s e o f t h e p r e v i o u s f r o n t r u n n e r by m o r e t h a n a f a c t o r o f 4. I t s r o b u s t n e s s has ena b l e d n e w c o m m e r c i a l q u a l i t y app l i c a t i o n s . T h e g r e e d y n a t u r e o f t h e a l g o r i t h m m a k e s i t i n d e p e n d e n t o f m e m o r y r e sou rce s . Un l ike o t h e r b i t e x t m a p p i n g a l g o r i t h m s , S I M R allows c ross ing c o r r e s p o n d e n c e s to a c c o u n t fo r w o r d o r d e r d i f fe rences . I t s o u t p u t can be c o n v e r t e d qu ick ly a n d eas i ly i n to a sent e n c e a l i g n m e n t . S I M R ' s o u t p u t has b e e n u s e d to a l ign m o r e t h a n 200 m e g a b y t e s o f t h e C a n a d i a n H a n s a r d s fo r p u b l i c a t i o n b y t h e L i n g u i s t i c D a t a C o n s o r t i u m . 1. I n t r o d u c t i o n The first step in most corpus-based multilingual NLP work is to construct a detailed map of the correspondence between a text and its translation (a b i t e x t map) . Several automatic methods have been proposed for this task in recent years. However, most of these methods address only the sub-problem of alignment (Catizone et al. 1989, Brown et al. 1991, Gale & Church 1991, Debili & Sammouda 1992, Simard et al. 1992, Kay & RSscheisen 1993, Wu 1994). Alignment algorithms assume the availability of text unit boundary information and their output has less expressive power than a general bitext map. The only published solution to the more difficult general bitext mapping problem (Church 1993) can err by several typeset pages. Such frailty can expose lexicographers and terminologists to spurious concordances, feed noisy training data into statistical translation models, and degrade the performance of corpus-based machine translation. Some multilingual NLP tasks, such as automatic validation of terminological consistency (Macklovitch 1995) and automatic detection of omissions in translations (implemented for the first time in (Melamed 1996)), have been technologically impossible until now, because they are highly sensitive to large errors in the bitext map. The Smooth Injective Map Recognizer (SIMR) is a greedy algorithm for mapping bitext correspondence. SIMR borrows several insights from previous work. Like Gale & Church (1991) and Brown et al. (1991), SIMR relies on the high correlation between the lengths of mutual translations. Like char_.align (Church 1993), SIMR infers bitext maps from likely points of correspondence between the two texts, points that are plotted in a two-dimensional space of possibilities. Unlike previous methods, SIMR searches for only a handful of points of correspondence at a time. Each set of correspondence points is found in two steps. First, SIMR generates a number of possible points of correspondence between the two texts, as described in Section 3.1. Second, SIMR selects those points whose geometric arrangement most resembles the typical arrangement of true points of correspondence. This selection involves localized pattern recognition heuristics, which Section 3.2 refers to collectively as the cha in r e c o g n i t i o n heu r i s t i c . SIMR then interpolates between successive selected points to produce a bitext map, as described in Section 3.3. 2. D e f i n i t i o n s Several key terms will help to explain SIMR. First, a b i t e x t (Harris 1988) comprises two versions of a text, such as a text in two different languages. Translators create a bitext each time they translate a text. Second, each bitext defines a rectangular b i t e x t space, such as Figure 1. The width

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Portable Algorithm for Mapping Bitext Correspondence

The first step in most empirical work in multilingual NLP is to construct maps of the correspondence between texts and their translations (b i t ex t maps) . The Smooth Injective Map Recognizer (SIMR) algorithm presented here is a generic pattern recognition algorithm that is particularly well-suited to mapping bitext correspondence. SIMR is faster and significantly more accurate than other alg...

متن کامل

A Portable Algorithm for Mapping Bitext Correspondence

The first step in most empirical work in multilingual NLP is to construct maps of the correspondence between texts and their translations (bitext maps). The Smooth Injective Map Recognizer (SIMR) algorithm presented here is a generic pattern recognition algorithm that is particularly well-suited to mapping bitext correspondence. SIMR is faster and significantly more accurate than other algorith...

متن کامل

Automatic Detection of Omissions in Translations

ADOMIT is an algorithm for Automatic Detection of OMIssions in Translations. The algorithm relies solely on geometric analysis of bitext maps and uses no linguistic information. This property allows it to deal equally well with omissions that do not correspond to linguistic units, such as might result from word-processing mishaps. ADOMIT has proven itself by discovering many errors in a hand-co...

متن کامل

Models of Co-occurrence

A model of co occurrence in bitext is a boolean predicate that indicates whether a given pair of word tokens co occur in corresponding regions of the bitext space Co occurrence is a precondition for the possibility that two tokens might be mutual translations Models of co occurrence are the glue that binds methods for mapping bitext correspondence with methods for estimating translation models ...

متن کامل

Bitext Maps and Alignment via Pattern Recognition

Texts that are available in two languages (bitexts) are becoming more and more plentiful, both in private data warehouses and on publicly accessible sites on the World Wide Web. As with other kinds of data, the value ofbitexts largely depends on the efficacy of the available data mining tools. The first step in extracting useful information from bitexts is to find corresponding words and~or tex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره cmp-lg/9609009  شماره 

صفحات  -

تاریخ انتشار 1996